Understanding inverse document frequency: on theoretical arguments for IDF
نویسنده
چکیده
The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in traditional probabilistic model of information retrieval. Reprinted from: Journal of Documentation 60 no. 5, pp 503–520
منابع مشابه
Global term weights for document retrieval learned from TREC data
A key element in modern text retrieval systems is the weighting of individual words for importance. Early in the development of document retrieval methods it was recognized that performance could be improved if weights were based at least in part on the frequencies of individual terms in the database. This observation led investigators to propose inverse document frequency weighting, which has ...
متن کاملComparative Analysis of IDF Methods to Determine Word Relevance in Web Document
Inverse document frequency (IDF) is one of the most useful and widely used concepts in information retrieval. When it is used in combination with the term frequency (TF), the result is a very effective term weighting scheme (TF-IDF) that has been applied in information retrieval to determine the weight of the terms. Terms with high TF-IDF values imply a strong relationship with the document the...
متن کاملAn Investigation of Indexing on the WWW
We propose a model that assists in understanding indexing on the World Wide Web (WWW). This model speciies key feature of indexing strategies that are currently being used. We also present an experiment assessing the validity of Inverse Document Frequency (IDF) as a term weighting strategy for WWW documents. The experiment indicates that IDF scores are not stable in the heterogeneous and dynami...
متن کاملUsing TF-IDF to Determine Word Relevance in Document Queries
In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appear...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Documentation
دوره 60 شماره
صفحات -
تاریخ انتشار 2004